H2Opus: a distributed-memory multi-GPU software package for non-local operators
نویسندگان
چکیده
Hierarchical ${\mathscr{H}}^{2}$ -matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N) complexity both memory and operator application makes them particularly suited large-scale problems. As a result, there is need software that provides support distributed operations on these matrices to allow problems be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms implementations matrix-vector multiplication matrix recompression hierarchical format. The new module H2Opus, performance-oriented package supports broad variety CPUs GPUs. Performance GPU setting achieved by marshaling tree data representation batched kernels executed individual MPI used inter-process communication. We optimize communication volume hide much cost with local compute phases algorithms. Results show near-ideal scalability up 1024 NVIDIA V100 GPUs Summit, performance exceeding 2.3 Tflop/s/GPU multiplication, 670 Gflop/s/GPU compression, which involves QR SVD operations. illustrate flexibility efficiency library solving 2D variable diffusivity fractional diffusion problem an algebraic multigrid-preconditioned Krylov solver demonstrate 16M degrees freedom 64
منابع مشابه
BEANS - a software package for distributed Big Data analysis
BEANS software is a web based, easy to install and maintain, new tool to store and analyse data in a distributed way for a massive amount of data. It provides a clear interface for querying, filtering, aggregating, and plotting data from an arbitrary number of datasets. Its main purpose is to simplify the process of storing, examining and finding new relations in the so-called Big Data. Creatio...
متن کاملA Portable 3D FFT Package for Distributed-Memory Parallel Architectures
1 I n t r o d u c t i o n Multidimensional FF’I’s are used frequently in engineerillg and scientific calculations, especially in image processing. Parallel implementations of FFT generally follow two approaches. One is the binary-exchange approach[l ,2], where data exchanges take place in all pairs of processors with processor numbers differing by one bit. Another one is the transpose approach[...
متن کاملA Distributed Multi-GPU System for Fast Graph Processing
We present Lux, a distributed multi-GPU system that achieves fast graph processing by exploiting the aggregate memory bandwidth of multiple GPUs and taking advantage of locality in the memory hierarchy of multi-GPU clusters. Lux provides two execution models that optimize algorithmic efficiency and enable important GPU optimizations, respectively. Lux also uses a novel dynamic load balancing st...
متن کاملDistributed Software Transactional Memory
This report describes an implementation of a distributed software transactional memory (DSTM) system in PLT Scheme. The system is built using PLT Scheme’s Unit construct to encapsulate the various concerns of the system, and allow for multiple communication layer backends. The front-end API exposes true parallel processing to PLT Scheme programmers, as well as cluster-based computing using a sh...
متن کاملGPU-Based Multi-start Local Search Algorithms
In practice, combinatorial optimization problems are complex and computationally time-intensive. Local search algorithms are powerful heuristics which allow to significantly reduce the computation time cost of the solution exploration space. In these algorithms, the multistart model may improve the quality and the robustness of the obtained solutions. However, solving large size and time-intens...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Advances in Computational Mathematics
سال: 2022
ISSN: ['1019-7168', '1572-9044']
DOI: https://doi.org/10.1007/s10444-022-09942-6